pacman::p_load(sf, tidyverse, tmap, spdep, funModeling)In-class Exercise 2
Overview
Water is a scarce resource. Goal 6 of the United Nations’ (UN’s) Sustainable Development Goals (SDGs) is to ensure availability and sustainable management of water and sanitation for all.
Getting Started
Water Point Data
We obtained the global water point and small water scheme level data from Water Point Data Exchange (WPdx) Global Data Repositories (WPdx, 2020). We accessed the WPdx-Plus (WPdx+) option and downloaded the full Shapefile under the Export option. As the data consists of water points around the world, we will later filter for water points within Nigeria in R in a subsequent step.
After downloading the Shapefile which can take a few minutes due to the large file size, unzip the folder and copy the Shapefiles (.dbf, .prj, .shp and .shx) into a data subfolder that shares the same directory as this Quarto file for ease of calling the files. We also want to rename all four files to geo_export so that we can reference these filenames more easily when we import the data.
3.2. Geographical Boundaries of Nigeria
We also need the geographical boundaries of Nigeria to make meaningful sense of its water point locations and to aid spatial visualisation. Here, we downloaded the Level-2 Administrative Boundaries (also known as Local Government Area (LGA)) data (ADM2) for Nigeria in year 2020 from geoBoundaries, the largest open and free database of political administrative boundaries globally (geoBoundaries, 2022). One can filter for Nigeria’s data by typing it in under the Name filter, followed by clicking on the download button under the column geoBoundaries, sub-column Full Release and for the row Nigeria, NGA, ADM2, 2020.
Similar to the water point data, we unzip the folder and copy the Shapefiles (.dbf, .prj, .shp and .shx) into the same folder as the water points Shapefiles. Here, we rename the files to geoBoundaries-NGA-ADM2 to indicate the data source (geoBoundaries), country (NGA) and administrative boundary level (ADM2).
4. Installing and Loading Packages in R
The code chunk below uses p_load() from pacman package to brings in the R packages for:
- Spatial vector data encoding (sf);
- Data-wrangling (tidyverse);
- Map plotting (tmap);
- Geospatial analysis (spdep); and
- Rapid Exploratory Data Analysis (EDA) (funModeling).
5. Importing Geospatial Data in R
5.1. Water Point Geospatial Data
We are now ready to import the geospatial data into the Quarto document. The code chunk below does so for the water point data by using st_read() function of the sf package. We specified the data source name (dsn) or directory of the file ("data/geospatial"), layer for the name of the Shapefiles ("geo_export"), and crs = 4326 to import the data in wgs84 geographic coordinate reference system (CRS), since the Shapefile is in wgs84. We also pipe a filter to obtain data that are in Nigeria only, by using the filter() function of dplyr package from tidyverse. The clean_country_name column is used for the filter, and note that the column name is truncated in the Shapefile due to character limit and should be keyed in correctly to perform the filter successfully.
wp <- st_read(dsn = "data/geospatial",
layer = "geo_export",
crs = 4326) %>%
filter(clean_coun == "Nigeria")One point to note is that while we can theoretically transform the data to projected CRS directly by using the st_transform() function of sf to facilitate the accurate computation of distances in a planar configuration, we want to keep it on hold for now as it will result in missing data points when we use st_intersects() subsequently to identify water points within each administrative boundary. This is because st_intersects() only works correctly if the geospatial data are in geographic CRS.
The simple feature data frame comprises 95,008 observations of 73 variables. In particular, we are interested in the variable status_clean (truncated to status_cle in the Shapefile), which tells us which water points are functional versus not. In addition, we will use the last variable, geometry, to perform data join for the recoded variables to the LGA boundaries data.
On a practical note, to avoid taking up too much memory space in GitHub, which has a memory limit of 100MB, we will extract the necessary data and save them in an rds file, and delete the geo_export Shapefiles from the data/geospatial folder, before committing and pushing the changes to GitHub. This is to prevent error in the process of pushing the commit to GitHub. We do so by running the relevant code chunks below and saving the rds file in the data/spatial folder, and then setting #| eval: false so that the codes that use the original Shapefiles and intermediate large files will not run when knitted. This way, those codes will be suppressed when rendering the Quarto file and analysis can be done using the eventual rds file.
Should we wish to run certain lines of codes that are suppressed, we can set to
#| eval: trueto allow normal evaluation during rendering, or run it manually in the RStudio environment.
In the code chunk below, write_rds() of the readr package is used to save the extracted sf data table into an output file in rds data format. We then do not need to go back to the original Shapefile to reload the full set of global water points data each time we use it, as the data size is very large, the time to load is long and it cannot be pushed to GitHub.
wp_nga <- write_rds(wp, "data/geospatial/wp_nga.rds")However, do note that after running the above code chunk, the wp_nga.rds file is still too large (140.2MB) to push to GitHub (100MB limit). Hence, we will further extract only the data that we wish to use for our analysis and save it as another .rds file, and remove this one, indicate #| eval: false and delete the wp_nga.rds file from our directory, before we commit and push the changes to GitHub.
5.2. Nigeria Level-2 Administrative Boundary Geospatial Data
We also import the Nigeria Level-2 Administrative Boundary (LGA) data into our Quarto file, similarly using st_read() of sf in the code chunk below. The data are saved in the form of a simple feature data table nga.
nga <- st_read(dsn = "data/geospatial",
layer = "geoBoundaries-NGA-ADM2",
crs = 4326)Reading layer `geoBoundaries-NGA-ADM2' from data source
`C:\zhuyiting1\ISSS624\In-class_Ex\In-class_Ex2\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS: WGS 84
glimpse(nga)Rows: 774
Columns: 6
$ shapeName <chr> "Aba North", "Aba South", "Abadam", "Abaji", "Abak", "Abaka…
$ Level <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ shapeID <chr> "NGA-ADM2-72505758B79815894", "NGA-ADM2-72505758B67905963",…
$ shapeGroup <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NG…
$ shapeType <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((7.401109 5...., MULTIPOLYGON (…
There are 774 observations of 6 variables in the nga file, including shapeName for the LGA that each region belongs to and geometry for the polygons, as seen using the glimpse() function of dplyr above. The geometry type is multipolygon. It is also in the wgs84 geographic CRS, just like the water point data. Hence for now, there is no need to perform st_transform() to align their CRS.
We also run a check for invalid geometries in the LGA data, using st_is_valid() of sf.
length(which(st_is_valid(nga) == FALSE))[1] 0
The output is 0 - there is no invalid geometry for the LGA polygons.
We also check for missing values in the LGA data, using is.na() of ursa to return TRUE/FALSE values and rowSums() of raster to tally the number of TRUE.
nga[rowSums(is.na(nga))!=0,]Simple feature collection with 0 features and 5 fields
Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS: WGS 84
[1] shapeName Level shapeID shapeGroup shapeType geometry
<0 rows> (or 0-length row.names)
6. Data Wrangling: Cleaning and Extracting the Necessary Data
6.1. Recoding of Missing Water Point Status Data
In the code chunk below, we use replace_na() of tidyr to replace the “NA” data in the status_cle variable with “Unknown”, as this is the variable that will be used subsequently. This is so that the observations with “NA” will not be excluded in subsequent analyses.
wp_nga <- read_rds("data/geospatial/wp_nga.rds") %>%
mutate(status_cle = replace_na(status_cle, "Unknown"))6.2. Exploratory Data Analysis (EDA)
In the code chunk below, we use freq() of funModeling to display the distribution of status_cle field in wp_nga for a quick view of the available classes and their distributions. We need to suppress this code chunk due to file size limit when we commit the changes and push to GitHub, by setting #|eval: false.
freq(data = wp_nga,
input = "status_cle")We see that there are 3 status_cle values that describe functional water points, namely
Functional(45,883, 48%),Functional but needs repair(4,579, 5%), andFunctional but not in use(1,686, 2%).
On the other hand, there are 5 values which indicate that the water points are not functional, including 7 mis-coded values due to a missing hyphen and lower “f”, and they are
Non-Functional(29,385, 31%)Non-Functional due to dry season(2,403, 3%)Abandoned/Decommissioned(234, <1%)Abandoned(175, <1%)Non functional due to dry season(7, <1%)
We see that over 1/3 of the water points are non-functional.
There are also 10,656 or 11% missing values which we recoded to Unknown using replace_na().
6.3. Extracting Water Point Data
In this section, we will extract the water point records by using the classes that we saw above in status_cle field. This will help us obtain the absolute numbers as well as allow us to calculate the % total later.
6.4. Extracting Functional Water Points
In the code chunk below, we extract the data for the functional water points into wpt_functional using filter() of dplyr for the 3 classes that we identified using freq() of funModeling.
wpt_functional <- wp_nga %>%
filter(status_cle %in%
c("Functional",
"Functional but not in use",
"Functional but needs repair"))Running freq() on wpt_functional to check, we can see that the same number of records for the 3 functional classes are captured as per in wp_nga. We similarly suppress the evaluation of the code chunk below due to file size constraint.
freq(data = wpt_functional,
input = "status_cle")6.5. Extracting Non-Functional Water Points
We repeat the above process for non-functional water points, using the code chunks below.
wpt_nonfunctional <- wp_nga %>%
filter(status_cle %in%
c("Abandoned/Decommissioned",
"Abandoned",
"Non-Functional",
"Non functional due to dry season",
"Non-Functional due to dry season"))freq(data=wpt_nonfunctional,
input = 'status_cle')6.6. Extracting Water Point with Unknown Class
Finally, for completeness, we also need to extract the water points with unknown status (missing status_cle field), using the code chunk below. Using str() of R’s utils, we confirm that the number of observations (10,656) tallies with that in the earlier frequency bar chart plotted using freq() of funModeling.
wpt_unknown <- wp_nga %>%
filter(status_cle == "Unknown")
str(wpt_unknown)6.7. Performing Point-in-Polygon Count
We want to find the number and proportion of functional, non-functional and unknown water points within each LGA. To do this, we use st_intersects() of sf to determine the cross-over between the LGA polygons in nga and water points in wp_nga. Thereafter, lengths() of Base R is used to return the number of water points in each class by LGA. Finally, we use mutate() of dplyr to add the new variables for total wpt, wpt functional, wpt non-functional and wpt unknown to nga sf data table, and assign it to a new variable nga_wp.
nga_wp <- nga %>%
mutate(`total wpt` = lengths(
st_intersects(nga, wp_nga))) %>%
mutate(`wpt functional` = lengths(
st_intersects(nga, wpt_functional))) %>%
mutate(`wpt non-functional` = lengths(
st_intersects(nga, wpt_nonfunctional))) %>%
mutate(`wpt unknown` = lengths(
st_intersects(nga, wpt_unknown)))Note that the symbol used is " ` " (backtick) and not " ' " (apostrophe). This is used when there is space and hyphen (-) in the variable name (e.g. total wpt).
Thereafter, we compute the percentage functional and percentage non-functional water points as pct_functional and pct_non-functional, using mutate() of dplyr in the code chunk below.
nga_wp <- nga_wp %>%
mutate(pct_functional = `wpt functional`/`total wpt`) %>%
mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`)6.8 Saving the Analytical Data Table
With the tidy sf data table, we save it in rds file format as nga_wp.rds for subsequent analysis, using write_rds() of readr.
write_rds(nga_wp, "data/geospatial/nga_wp.rds")Before we move on to the next section on spatial analysis, we will set #| eval: false for all code chunks that rely on either the geo_export Shapefiles or wp_nga as the files are too large and need to be deleted before committing and pushing the changes to GitHub. We will work with the geoBoundaries-NGA-ADM2 Shapefiles and nga_wp.rds file, which is only around 2.1MB in size respectively.
7. Visualising the Spatial Distribution of Water Points - Thematic Mapping
To avoid error with the removal of the large data files and suppression of the relevant code chunks above, we use read_rds() to load the nga_wp.rds file at the start of the next section of our analysis.
nga_wp <- read_rds("data/geospatial/nga_wp.rds")As we have performed st_intersects(), we can use st_transform() of sf to convert the data from an ellipsoid wgs84 CRS to a planar projected CRS via mathematical reprojection of the coordinates, prior to distance calculations. This is done using EPSG: 26392 for Minna / Nigeria Mid Belt (Spatial Reference, 2022), in the code chunk below. We also check that the transformation has been done correctly using st_geometry() of sf, where the projected CRS field indicates Minna / Nigeria Mid Belt.
nga_wp26392 <- st_transform(nga_wp,
crs = 26392)
st_geometry(nga_wp26392)Geometry set for 774 features
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 26662.71 ymin: 30523.38 xmax: 1344157 ymax: 1096029
Projected CRS: Minna / Nigeria Mid Belt
First 5 geometries:
We see that the bounding box values have changed from the decimal degree format where the minimum and maximum values of x and y were between 2.7o and 14.7o, to between 26,663m and 1,344,157m (MapTools, 2022; epsg.io, 2022).
We set tmap_mode() of tmap to “view” to activate interactive viewing mode instead of static maps, to better zoom into any of the 774 LGAs for further analysis if needed.
tmap_mode("view")In the code chunk below, we plot the wgs84 and crs = 26392 versions of the Nigerian LGA using tm_shape() of tmap, and note that the latter appears to flatten the mapping area out a little more than the former. We can also see that the plots are now in interactive mode.
nga_wp_wgs <- tm_shape(nga_wp) +
tm_borders(alpha = 0.5) +
tm_layout(main.title = "Geographic CRS",
main.title.position = "center")
nga_wp_proj <- tm_shape(nga_wp26392) +
tm_borders(alpha = 0.5) +
tm_layout(main.title = "Projected CRS",
main.title.position = "center")
tmap_arrange(nga_wp_wgs, nga_wp_proj, asp = 1, ncol = 2)7.1. Quick Plots by Equal Classification
The code chunk below uses qtm() of tmap to do a quick thematic map plot of the Nigeria LGA, coloured by the number of water points in equal classification method. The four quadrants represent (from top left in a “Z” shape) total water points, functional water points, non-functional water points and water points with unknown functionality status.
total <- qtm(nga_wp26392, "total wpt")
wp_functional <- qtm(nga_wp26392, "wpt functional")
wp_nonfunctional <- qtm(nga_wp26392, "wpt non-functional")
unknown <- qtm(nga_wp26392, "wpt unknown")
tmap_arrange(total, wp_functional, wp_nonfunctional, unknown, asp=1, ncol=2)